New Insights into the Genetic Regulation of Plasmodium Falciparum Obtained by Bayesian Modeling

The most fatal and prevalent form of malaria is caused by the bloodborne pathogen Plasmodium falciparum (henceforth P.f). Annually, approximately three million people died of malaria. Despite P.f devastivating effect globally, the vast majority of its proteins have not been characterized experimentally. In this work, we provide computational insight that explore the modalities of the regulation for some important group of genes of P.f, namely components of the glycolytic pathway, and those involved in apicoplast metabolism. Glycolysis is a crucial pathway in the maintenance of the parasite while the recently discovered apicoplast contains a range of metabolic pathways and housekeeping processes that differ radically to those of the host, which makes it ideal for drug therapy. We have been able to validate some of our findings from available literature and therefore provide a basis to give theoretical insight for some genes regulations, which has not been characterized experimentally.


Introduction
The most fatal and prevalent form of malaria is caused by the bloodborne pathogen Plasmodium falciparum (henceforth P.f). Annually, approximately three million people died of malaria. Also, hundreds of millions of people in a year become clinically ill. The negative influence of these results is huge and its socioeconomic impact is beyond measure. This influence is particularly prominent in the Africa continent, where an estimated US$12 billion is been lost yearly (2; 9). Reports has shown that the parasites is growing resistance to existing drugs. Therefore, there is a huge and urgent need to discover and validate new drug or vaccine targets to enable the development of new treatments for malaria (4). The ability to discover these drug or vaccine targets can only be enhanced from our understanding of the detailed conceptual view of the gene regulatory circuitry of P.f.
Previously, two computational studies have attempted to characterize the genetic regulation of glycolytic genes and those specific for the apicoplast, namely, Khanin and Wit, 2004 and Barrera et al. 2004. Their models could only show whether two genes interacted but were unable to explain the modalities of the genes regulation. In this work, we provide computational insight into the regulation of Plasmodium falciparum genes in glycolysis and apicoplast pathways using the time-series gene expression measurements of the intraerythrocytic development cycle (dataset of Bozdech et al. (5)). Our work can be said to be the first study that looks closely at important group of genes in P.f and attempt to elucidate the genes regulatory connections. The dataset of (5) has been shown (25) to compare very well with the other microarray dataset of (16), a recent comparison of this dataset to the results on other strains of P.f can be found in (18). And since the two existing computational genetic networks for P.f were induced using Bozdech et al. results, we made this data our basic dataset in this study. For the reconstruction of genetic regulatory interactions from the data, we applied Bayesian inference of the probabilistic model, previously developed and tested on the yeast S. cerevisiae (3). This model is based on the biologically motivated Boolean logic semantics, but a probabilistic one i.e. giving the possibility to incorporate uncertainty about the noisy data and the noisy process of genetic regulation. The framework allows for a particular gene to find a set of its regulators given a particular Boolean logic function governing this regulation. Boolean logic is a simple and particularly suitable way to model the working principles of the cis-regulatory elements. The 'OR' logic represents that a gene can be activated by one Please note that this article may not be used for commercial purposes. For further information please refer to the copyright statement at http://www.la-press.com/copyright.htm of a few different possible transcription factors. In case of 'OR-NOR' logic, a gene is regulated by a set of possible activators and a set of possible inhibitors. The gene is transcribed if and only if one of its possible activators is active and it is not repressed by one of its possible repressors. We employ this kind of probabilistic graphical models with Boolean logic semantics for modeling the genetic regulatory interactions of P.f.
The write-up is organized as follows. In Section 2, we describe the systems and methods applied in our modeling and analysis of the pathway and metabolism under consideration. Section 3 contains our results and discussion of the implication of these results. We end this paper in Section 4 with a conclusion and further-work.

The model of gene regulatory interactions
Here, we describe the probabilistic graphical model underlying our approach of inferring gene regulatory interactions of P.f. We assume that genes have only two states-active and not active, and model them with stochastic variables having binary values. The general structure of the gene interaction in our models is represented by a directed graph (see Fig. 1). We assume that the variable X i (regulator) can execute its influence on the variable Y (regulatee) independently of other possible regulators X 1 , …, X n of Y. The biological mechanism underlying this modeling assumption is the binding of protein transcribed by the regulator to the DNA of the regulatee. This process is not deterministic, rather each gene X i can regulate the gene Y with probability θ i and can fail to do this with probability 1 -θ i . In the graphical model intermediate variables I 1 , …, I n are introduced, through which the variables X 1 , …, X n execute their influence on a given common effect variable Y. Each intermediate variable I i has only one parent, the variable X i . It's probability distribution is defined as follows: given that X i = 1, I i takes the value 1 with probability θ i and the value 0 with probability 1 − θ i , respectively. Given that X i = 0, I i takes the value 0 with probability 1. The combined regulatory influence on the variable Y is calculated as the boolean function F on the input variables I 1 , …, I n . If X 1 , …, X n are activators, then the state of the variable Y is F(I 1 , …, I n ); if X 1 , …, X n are inhibitors, the state of Y is 1 − F(I 1 , …, I n ). The boolean "interaction function" F defines in which way the intermediate effects I i , and indirectly the variables X i , interact. Here, we consider the interaction function 'OR'. The semantics of the 'OR'-function implies that the variables X i are each assumed to be sufficient to influence Y. In the present work we apply two models: the simple 'OR'-model with activatory regulation and the complex 'OR-NOR'-model with activatory and inhibitory regulation. In the complex model, the regulatory influences of multiple activators and multiple inhibitors are combined with 'AND'function as depicted in Figure 2.
Introduction of the hidden state variables I i allows to insert "noise" into the Boolean logic based models. It allows to model that the biological mechanism of the regulation of one gene by another could be inhibited for unknown reasons. Thus, the input variables can be considered as observables from which we make our noisy measurements, while the hidden variables have the "true" latent biological values.

Bayesian model selection
We employ the Bayesian methodology for learning the structure and parameters of the model from data. The Bayesian approach addresses the problem as calculating the posterior probability   One of the MCMC approaches is Gibbs sampling. Gibbs sampling reduces the problem of dealing simultaneously with a large number of unknown parameters in a joint distribution into a simpler problem of dealing with one variable at a time, iteratively sampling each from its full conditional distribution given the current values of all other variables in the model.
The software OpenBUGS (BUGS stands for Bayesian Updating with Gibbs Sampling) is the general purpose software for Gibbs sampling on graphical models (23). OpenBUGS provides a declarative language for specifying a graphical model, i.e. the specification of the model likelihood and of the prior distributions for all parameters is required. The output of Markov chain simulation is used to summarize the posterior distributions of the variables of interest. The present approach utilizes the Linux version of the software OpenBUGS.
Our problem of model selection is formulated as follows: given the data on the gene Y and its potential regulators X 1 , …, X p , for a given boolean logic function F, identify the subset X 1 , …, X n of actual regulators of Y. We substitute the model indicator m ∈M with the variable indicator γ = (γ 1 , …, γ p ), a binary vector, representing which of the X j , j = 1, …, p should be included in the desirable "true" model. This allows the consideration of one joint space of the model parameters and the variable indicator, keeping the dimensionality constant across all possible models. By introducing the variable indicator, the "OR" model may be written as: where Y and X j is the observed data and θ and γ are the parameters. This represents the specification of the model likelihood.
The Bayesian approach requires specification of prior distributions for the model parameters. We defined the priors for the parameters θ j with Beta distribution, since Beta distribution constrains the parameters to the [0,1]-interval. We use a hierarchical formulation of the distribution, i.e. with hyperparameters a j and b j : The hyperparameters a j and b j are defined in two different ways, dependently on the parameters γ j . If γ j = 1, they are defined equal to 1, therefore making the prior non-informative (Beta(1,1)). In the case γ j = 0, the parameters are called (pseudopriors). The pseudopriors may be chosen in a way to help increasing the efficiency of the sampling procedure. Efficient performance can be achieved when the moves of the MCMC chain between different models γ could be "local". Therefore, the so called proposal densities for the pseudopriors can be used, which are being estimated by a pilot run of the MCMC for the saturated model, i.e. the model where all terms γ j = 1 for all j. Thus, we calculate the hyperparameters a j and b j by the formulas (method of moments): 1 a an j ), where mean j and var j , the mean and the variance of the parameters θ j , are estimated from the pilot run of the saturated model.
Next, one must define the prior distribution for the variable indicator γ. Since the terms γ j are independent, the prior can be decomposed into independent Bernoulli distributions for each term: γ j ∼ Bernoulli( j ), where j is the prior probability to include term j into the model. The prior j = 0.5 is non-informative in the sense of favoring all models equally, but is not non-informative with respect to the model size. To favor more parsimonious models (i.e. with small number of actual regulators), we used j = 0.2.
The runs of the MCMC were summarized and monitored for convergence using R-package CODA (http://cran.r-project.org).
We obtained the Markov chain samples on the parameters γ j and θ j . For the 'OR'-model, we used 20000 iterations of the Markov chain for the burnin time and 50000 iterations for the parameter estimations. For the complex 'OR-NOR'-model, we used 30000 iterations for the burn-in, and 100000 iterations for estimations. If the frequency of 1s in the chain for γ j exceeded 0.7, we assumed that γ j = 1and the respective regulator should be included in the "true" model. Otherwise, the regulator j should be excluded.
We apply learning the 'OR' and 'OR-NOR' models from data for each gene, considering all other genes in the dataset as candidate regulators. Hence, in our approach, bidirectional regulations might be inferred i.e. the gene X might be deduced to regulate Y, and the gene Y might be deduced to regulate X. Thus, cycles can be present in the global picture of genes regulation. This is opposite to other approaches based on graphical models such as e.g. Bayesian networks, where the learning of all interactions at once, using global criteria, is executed. The "local" model learning, presented in this paper, is of greater advantage, since it is capable of capturing the biological reality (feedback regulation etc.) more adequately.

Model checking
After the execution of the MCMC sampling and the estimation of the variable indicator γ, the check of goodness-of-fit of the model to data is required, to check whether the model assumptions were appropriate. Bayesian model checking uses the posterior predictive distributions (10). The goal is to perform posterior predictions under the model and to assess the discrepancy between predicted and observed data. If the model is reasonably accurate, the predicted data should be similar to the observed data.
Here, we wish to check the ability of the concrete regulatory model, defined by the inferred vector γ, to predict the state of the gene Y from the states of its regulators.
In the present framework, the posterior predictions can be computed by simulation: in every MCMC loop, a replicate y rep is generated conditionally on the currently generated parameters θ. (Note that here fixed binary values for γ, estimated previously by the model learning, are used.) Based on the MCMC simulations, the estimate E(y rep ) of the replicate can be made. The replicate is generated for each i = 1, …, N, where N is the number of samples in the Y dataset. Then, the individual observations of Y in the dataset y i , i = 1, …, N must be compared to the replicate data. For the comparison, we use the residual function Observations, for which the residual is not close to 0, indicate some lack-of-fit of the model and should be regarded as outlier. We regarded the residual as not close to 0, if its absolute value exceeded one estimated var(y i rep ). We calculate the model prediction accuracy as the percentage of non-outliers.

Types of regulatory situations considered
Our data is the time-series data, i.e. we have measurements of genes at subsequent time points t = 1, …, T,where T = 53. The true biological time resolution of the gene transcription and activation is yet unknown. It can be assumed, that a gene is active at the same time point as its activators, or, that it becomes active at the next time point. We refer to the first situation as 'simultaneous' regulation, and to the second as 'time delay' regulation. Both situations are considered in the present paper.
We treat gene measurements at each time point as statistical data samples. In the case of 'simultaneous' regulation, the state of a gene in the sample t depends on the states of its regulators in the same sample. Here, we have 53 data samples. In the case of 'time delay' regulation, the state of a gene in the sample t depends on the states of its regulators in the sample t − 1, so we have 52 data samples.

Data discretization
For the discretization of the continuous gene expression values into two states (0 -not active, 1 -active) we used a vector quantization technique based on the clustering algorithm k-means. For each gene we clustered its expression values into two groups by the k-means algorithm with two initial values: 0 and the maximum expression value of the gene.

Glycolysis pathway
From the public database PlasmoDB (http://plasmodb.org), we harvested twenty genes that are known to be involved in the glycolysis pathway. The hypothetical functions for each of these genes are presented in the Table 1.
We found the time-resolved gene expression profiles of the eighteen of these genes in the dataset of (5). We averaged the values from multiple oligonucleotides representing same gene. Then we discretize the continuous gene expression values into binary values 1 and 0 representing that the gene is active and not active.
We applied our approach for learning the 'OR'and 'OR-NOR'-models from data. Two models have different semantics and can bring slightly different results. Learning the "OR"-model identifies only the activators of a gene, i.e. the model "explains" the non-activity of the gene with the failure of its activators. In the "OR-NOR"-model, the non-activity of the regulatee is also "explained" by the activity of its inhibitors. We applied model learning for each gene in the dataset, considering all other genes as candidate regulators. We have considered both 'simultaneous' and 'time delay' situations. The results of our 'simultaneous' and 'time delay' learning of the 'OR-NOR'-model for 18 genes of the glycolysis pathways are summarized in the Tables 2 and 3, and represented graphically in the Figures 3 and 4. The graphs were generated with the program GraphViz (www. graphviz.org). The results of the 'simultaneous' and 'time delay' learning of the 'OR'-model are displayed in supplementary Tables 1 and 2 and supplementary Figures 1 and 2.
The 'OR-NOR'-learning mostly supported the results of the 'OR'-model but found some more activators and inhibitors with increased accuracy. The regulatory network in Figure 3 reveals the strategic position and hence the key regulatory role of the genes PF11_0157, PFD0660w, PF14_0341 and PF13_0141. The inhibitory connections between the genes PFD0660w and PFL0780w, from the gene PFD0660w to the gene PFI0755c, and between the genes PF14_0425 and PF13_0144 might indicate three groups of genes working in timely separated manner. One group include the genes PF11_0157, PF13_0144, PF11_0294, PF13_ 0269 and PFD0660w. The second group  (8), the second group functionality or connectivity was overwhelmingly confirmed except for genes PF13 _0141 and PF14_0378. We will suggest that a further biological studies should be carried out to check our prediction here. Furthermore, in group three, using this tool, we are able to show that genes PF11_0338 and PFL0780w, PFL0780w, PF11_0338 and PFC0831w and PFC0831w and PFL0780w are functionally connected. This tools could not verify the functional connectivity of PF11_0208 and PF14_0425 as we have shown theoretically. Based on the correctness of our predictions so far, we will suggest that these and group one connectivities (including their regulatory modalities) as shown here should also be tested biologically. The genes PFD0660w, PF14_0341, PF11_0338, and PF14_0425 present interesting crosspoints between the separate groups. In the recent review (24) that cataloguizes the various drug targets of P.f., three genes PF14_0341, PF13_0141 and PF14 _0425 which encode three important energy metabolites, namely enzymes EC 5.3.1.9 (glycose-6 phosphate isomerase), EC 1.1.1.27 (lactate dehydrogenase) and EC 4.1.2.13 (aldolase) were stated as possible drug targets genes. Our theoretical finding predicts the important regulatory role of these genes in the glycolysis. The regulatory interactions of these genes to others reconstructed by our learning procedure should be verified in biological studies. Furthermore, biological literature supports our prediction that the gene PF14 _0598 is been activated by PF14_0378 (20).
The 'time delay' regulatory network (see Table  3 and Fig. 4) suggests the key regulatory role of the genes PF11_0157, PF11_0208, PF14_0341 and PF10_0155. The graph in Figure 4 also reveals the groups of closely connected genes. Interestingly, the gene PF10_0155 is connected to both enzyme genes PF14_0341 and PF13_0141. It was shown experimentally that the gene PF13_0269 is been activated by PF11_0157(22) as we have predicted here. It is interesting to note that the metabolic pathway maps with enzymes for the P.f glycolysis pathway available at KEGG database supports our predicted interaction depicted in Figure 4. However, Figure 4 contains more informations and therefore can be used to update the KEGG database. Barrera et al. (1) applied their probabilistic genetic network approach to the gene expression profiles of ten enzymes from the glycolytic pathway of P.f. Note that apart from the target genes that coded for all the 10 enzymes pertaining to the glycolytic pathway, included in their analysis are also 40 best predictors for each glycolytic target (289 distinct oligos in total). The authors assumed the 'time delay' regulation. Their estimation procedure is based on the conditional entropy minimization to discover subsets of genes predicting the target gene at best. Their results provided a biologically meaningful list of genes with putatively similar functions, which are also obtained in (5), but it is not possible to study the modalities of the genes regulation with their approach. Furthermore, the authors worked on the level of oligonucleotides i.e. representing one gene with multiple oligonucleotides, thus inserting bias from the statistical point of view. Among the oligos for target genes that code for all the 10 enzymes pertaining to the glycolytic pathway, in their network, the oligos for the  genes PF14_0341 and PFI0755c are the only ones that shared direct contact. This is also shown to be so in our predictions of (Fig 3) in addition to the modality of their regulation. Khanin and Wit (13) derived their overall malaria gene network in the intraerythrocytic development cycle based on the 3048 genes. Note  (13) averaged the values from multiple oligonucleotides representing same gene. They claim that nine genes of the glycolysis pathway share five links among themselves and show that the probability of nine randomly picked genes having 5 links is 0.01% given the connectivity matrix. Apart from the fact that, their output provided no hints on the altitude of the genes during regulation, we found out that theses genes share more than five links.

Plastid genome
We considered the set of genes annotated to make up the plastid genome, otherwise known as the apicoplast. We harvested the oligonucleotides representing each of the 26 putative apicoplast genomeencoded proteins listed from the DeRisi's laboratory (malaria.ucsf.edu). In this case, we found in the 3D7 gene expression dataset (S01_ 3D7_Raw.txt), taking also from the DeRisi malaria site, that only one oligonucleotide was used to represent each putative protein encoded genes of the plastid genome.
The apicoplast can be thought as a cell living within another cell (here P.f) and contains all the familiar cellular processes such as DNA replication, transcription, translation, fatty acid synthesis, isopentenyl diphosphate synthesis, aromatic amino acid synthesis and the heme synthesis (19).
The results of our 'simultaneous' and 'time delay' learning of the 'OR-NOR'-model for 26 genes of the apicoplast are summarized in the Tables 5 and 6, and represented graphically in the Figures 5 and 6. The results of the 'simultaneous' and 'time delay' learning of the 'OR'-model are displayed in Supplementary Tables 3 and 4 and Supplementary Figures 3 and 4. The graphs for 'simultaneous' 'OR'-and 'OR-NOR'-regulation contain disconnected components suggesting the groups of closely related genes. One can see the strategical positioning of the genes ORF129, Clp, ORF91, tufA, PtRNA-Pro, and rpl16 indicating the key regulatory role of these genes. We discovered from the literature that the genes rpl23, tufA and rpl16 has been tested in wet experiments and have been marked as putative drug/herbicide targets (19). The observation of the 'time delay' regulatory interactions inferred further suggest the important role of Clp, rpl23, rpl16 but also of the genes rpl2, PtRNAGln and PtRNAThr. The gene rps19 was inferred to be regulated by many other genes. The graph of the 'time delay' 'OR-NOR'-regulation is much more connected than that of the 'simultaneous'. Table 4 shows the functional interconnection as derived in (Fig. 2) (B) of Barrera et al. (1). There are a lot of deviations, contactwise, in the contacts we predicted. Using the genes rpl23, tufA and rpl16 that has been marked as putative drug/herbicide targets in Ralph et al. (19), we noted that our derived networks in From all the information that we could gathered from PlasmoDB, DeRisi malaria site, KEGG database and the metabolic maps of the apicoplast from the work of Barrera et al. which we have encapsulated in Table 4, presently, nothing else is known about the metabolic maps of the apicoplast genes. So our work provides the first insight into the apicoplast genes regulatory connections.

Conclusion
In the present work we have elucidated regulatory interactions of the genes of P.f. in glycolysis pathway and apicoplast metabolism. Some of our predictions found support in the biological literature, others would be worthwhile to verify in the 'wet' laboratory. We have proposed the groups of closely connected genes (pathways) and crosspoints between them. Our approach is based on the Bayesian learning of the probabilistic graphical model explicitly representing the logic dependencies between a gene and its regulators. Our method allows for elucidating more complex multigene relations which go beyond pairwise relations retrieved by other approaches. Also, we derive sparse connections between the genes, which can be further interpreted and validated in the laboratory, as opposed to the previous authors (e.g. Khanin and Wit, 2004). An important advantage of the Bayesian approach is that it enables the inclusion of "subjective" prior information into the model. In this study we used the subjective prior specification to enforce the number of gene regulators to lie in the small range. Potentially, one could defi ne priors aiming to incorporate previous biological knowledge into the model learning.
We have identified a few genes (like PF11_0157, rpl23) which regulate many other genes (master activators). Further theoretical extension of our work would be to collect potentially commonly regulated genes, scanning their upstream regions for commonly present, conserved sequence motifs (for example, by means of the PhyME algorithm (21)), computing a probability weight matrix from such motifs and scanning the genome of P.f. for further genes that harbor these motifs in their upstream region. Our initial work in this workflow revealed a set of three motifs commonly contained in seven genes (PFL0780w, PF14_0425, PF13_ 0269, PF11_0294, PF13_0144, PFC0831w and PFD0660w) computationally predicted to be activated by PF11_0157 (Tachado et al. (22) has shown experimentally that PF13_0269 is been activated by PF11_0157). Restricting to high confidence presence of the motif, and requiring at least two of the three motifs to be present, a set of seven genes (PFL1160c, PFB0480w, PFI0260c, PFI1605w, MAL8P1.143, PF14_0363 and PF14_0472) were     Table 1. 'OR'-regulatory interactions of eighteen (18) genes in the glycolysis pathway, 'simultaneous' gene activities.

Genes
Activators Accuracy